Predicting AirBNB Costs in New York City (Statistical Modelling)

Name(s) & ID(s) of Group Members:

  1. Mashal Khan - s3906303
  2. Natneal Gizw - s3897250

Table of Contents - Phase 2

Introduction

Phase 1 Summary

In part of this two-part or phase project, we recently completed the first part of Phase 1. The goal of the investigation in Phase 1, was to perform data-preprocessing and data exploration on the dataset of New York Airbnb vacation rental prices. We took price as out compared to other key factors to help tourists and property owners to see how prices range across their area and the whole of New York City.

In Phase 1 the overall dataset was cleaned by removing outliers, dropping unnecessary and inappropriate columns to machine learning. The target feature was assigned with the 'price_per_night'. Furthermore, different levels of visualisations were plotted to help identify trends and correlations between the target feature and other descriptive features across the dataset.

The findings revealed that there was a significant impact on the price based on the number of host listings, room types and the boroughs (Neighbourhood Groups) of listings. Furthermore, we were also able to identify that many of our other variables such as availability, total reviews and reviews per month had no effect on the Price (Our Target Variable). Overall, we found that the borough (Neighbourhood Groups) tends to be the biggest indicator of the price of an Airbnb rental.

Report Overview

This report aims to use price_per_night from the Airbnb dataset as the target feature and predict its value in relation to other suitable features, both numerical and categorical. This report will achieve the fitting of the cleaned and preprocessed data to a statistical model through the use of a python module sklearn, and thus create visualisations of them. Moreover, factors and features of the model will be analysed such as the plots of residuals in relation to the suitability of certain statistical models which will be utilised to describe the data of the chosen data set.

Overview of Methodology

The dataset that was investigated was based out of New York City, regarding the city's AirBNB usage measured in various metrics. The aim of the statistical modelling was to explore a topic that we desired and the requirement was that the dataset must be supportive of multi-linear regression analysis.

Multi-linear regression is when we use multipule explanatory or independent variable against the single response variable. This is useful as it allows us to see the overall strength of the response variable compared to many independent variables. Furthemore, we can also see how each independent variable can affect the strength on a case-by-case basis that is seeing how each independent variable affects the dependent variable. Therefore, multi-linear regression is a representation of the overall strength between the response variable and multiple independent variables.

Our plan was to include all the features of the dataset that significantly aid in predictive modelling. By one-hot-encoding our dataset, including the neighbourhood feature, we will end up with hundreds of features; we expect to use neighbourhoods with the lowest p values to increase the accuracy of our statistical modelling solution.

We then fit the data to an OLS regression model. We use a summary of the model to check for the association between our independent variables. We then check for the 4 assumptions of a multiple linear regression model: Residual normality, constant residual variance, residual independence and a significant linear correlation.

To reduce our model, we sort our regression model by the features with the highest p values, remove them from the model and recalculate the regression results until we are left with features with a p-value lower than 0.05.

The categorical features of the preprocessed dataset will be converted to binary, to comply with the requirements of the stats model function.

Statistical Modelling

We will import all necesssary libarires required for the stastical modelling and check the shape of the dataframe to ensure we are within the 5000 row limit and have the required columns.

Model Overview

As we can see we are going to be using the above DataFrame for the Stastical Modelling within the Investigation. Now let's check if the datatypes of the variables are as expected. All variables data-types are as expected

We will now display the variables we are going to be using in the Regression Model.

For Independent Variables:

For Dependent Variables:

One-Hot-Encoding

Lets make a new copy of the Original DataFrame. We will use this new copy for one-hot-encoding the categorical columns in the Dataset

We will first re-check that all the datatypes for the columns are corrected and as expected. All numerical variables must be int or float and all categorical variables must be object format.

This looks correct and data-types of columns are as expected.

We will now use Python's join function which allows us to concenate strings. We will extract the column names and conceneate them together, with the "+" symbol seperating them.

We now have the data we want to one-hot-encode. We will one-hot encode the data using the get_dummies() function which will automatically one-hot encode multipule categorical variables.

We also had to make some modifications to the names of the columns to be able to be fitted within our OLS Regression Model. We did this by removing spaces, astrophes, dahses and dots and replacing them with with blank values. This ensured that no errors will occur in the OLS Regression Model Fit.

We will now extract and concenate the columns of the encoded DataFrame and store in the form_deps_encoded variable. We will then concenate the form_deps_encoded variable with the name of our target variable (price_per_night)

OLS Regression Fit in Stastical Modelling

Our Take on One-Hot-Encoding Results for Neighourhood Column

We have observed above that one-hot-encoding the Neighbourhood column above results in 100s of extra variables being produced. We recgoinse this, and tried to approach it by clustering, however this not excatly possible due to the nature of the dataset as the Neighbourhood_Group Variable is a "cluster" of those respective Neighbourhoods.

So we are left with two options, either we drop the Neighbourhood column as it can negatively affect our mulitpule regression model or Keep it if it provides us with a higher r-squared value.

To test this we will produce two OLS Regression Models, one without Neighbourhoods and one with Neighbourhoods to compare their r-squared values.

We have now gathered our formula and the DataFrame needed for the OLS Regression Results. Using this formula and DataFrame we will now we fit an OLS (ordinary least squares) model to our encoded data.

OLS Model Without Neighbourhoods

First we will begin by fitting the OLS Model without Neighbourhoods variable, to see what r-squared value is achieved.

For simplicity sake and to prevent the notebook from feeling cluttered we will include the whole code within the same cell and only display the r-squared value.

OLS Model With Neighbourhoods

We will now include the Neighbourhood Column, to see what R-Squared Value is acheived and how it compares to the model above.

Summary of OLS Regression Results

With Neighboourhoods vs Without Neighbourhoods:

The r-squared value from the model contaning the Neighbourhood variables had a higer r-squared value of 0.541 compared to 0.459 for Without Neighourhoods. Thus, we will only use the OLS Model that contains the r-squared value of 0.541 for the rest of the investigation.

Analysis of OLS Regression Results

The OLS Regression Results show an R-Squared result of : 0.541. This indicates that 54.1% of the variation in price (price_per_night) is explained by our Independent Variables which are displayed above. By looking at the p-values, we observe that the majority of them are highly significant, though there are a few insignificant variables at a 5% level.

Furthermore, the Prob (F-Stastic) results display a value of 0.00 indicating that the regression is meaningful.

Setting up Residuals

The above table shows the predictions of the hotel prices alongside the residuals which signify the deviance of the actual price with the predicted, based on the full regression model.

Whats observed from the above graph is that there is a generally linear relation between actual price and the predicted prices of the hotels untill the 150 dollar mark where it begins to fade out and plateu for the other half of the graph. The greatest density of the data is near the 50 to 200 dollar reigion, where majority of the predictions refered to.

Checking conditions for Full Regression Model

Multiple linear regression depends on 4 conditions that we will check for: residuals being normal, residuals having constant variability, residuals are independant and a significant linear correlation.

Condtion 1 : Linearity

This scatterplot and residual plot indicates that there is no linear relationship present. This is because there is no horizontal banding of points, thus indicating a non linear relationship. Furthermore, it's also evident that the residuals indicate outliers within the dataset as some residuals are randomly distanced away from the pattern such as (50,200) and (170, -200).

Condtion 2 : Constant Variability

There is a general upards trend in the residuals of the predicted hotel prices, with the majority of the prices and residuals densely located below the 150 dollar mark. However, a general trend of constant variability can be observed, with prominent horizontal streaks in visibly equal intervals, due, potentially, to the conseqences of the dataset's preprocessing.

Condtion 3 : Normality

While the normal probability plot of the residuals shows small deviations from the regression line, indicating minor irregularities, there are only few outliers that skew the residual distribution. There does appear to be one substantially elongated tail, however it still maintains a relatively normal distribution centred at 0.

The distribution of the residuals for the fully fitted regression model of this dataset shows to be normally distributed, thus signifying that this model supports, to an extent the normality presumption.

Condtion 4 : Stastical Independence

Since we don't know the times the observations were made, as it wasn't included in our dataset, and we don't have any spatial variables, we can't plot a "time of observation vs. residual" plot. However, we are certain that the observations in the model are independent due to the method we used to fit our model. Our full dataset included all listings in New York city, and to test for randomness in observations, we randomly sampled the data which allows listings to be observed in random different places, which makes our observations independent from each other.

Backwards Feature Selection

Credit of this code is referenced below.

Setting Up Residuals For Reduced Model

Below are the actual and predicted values from the reduced model, along with the residuals calculating their difference.

This scatter plot of the predicted prices against the residuals for the reduced regression model closely resembles that of the one for the full regression model, with a similar, yet slightly more linear pattern observed which is more defined that the previous scatter plot.

Checking conditions for Reduced Regression Model

This scatter plot of the predicted prices against the residuals for the reduced regression model closely resembles that of the one for the full regression model, with a similar, yet slightly more linear pattern observed which is more defined that the previous scatter plot.

This histogram plot of the residuals of the new reduced regression model closely resembles that of the one for the full regression model, with a similar tail to the first, making it slightly right skewed. Though, it maintains a roughly normal distribution that is similarly centred at 0.

This scatterplot shows the relationship between the residuals of the reduced model and the actual prices of the dataset. We are still seeing a roughly constant spread of residuals in this plot, similarly to the figure shown in the full model residual variance plot. This plot has barely changed, with most of the prices packed in the under-150 dollar price range.

Critique and Limitations

The overall approach adequately fitted the model with all the relevant features and variables from our dataset. It also adequately removes the variables from the fully fitted model with a p-value over 0.05, performing quite well.

A limitation to our approach may be in the way we preprocessed our data which could have potentially caused the vertical stripe patterns in our plots of the features from the regression model. It has been thought that such a pattern in our plots may have come from getting rid of instances of data with certain variables within a specific feature, thus making the likelihood of certain prices with those limited amounts of variables greater, along those vertical lines.

Also the way we preprocessed the data during the first phase, namely the way outliers were treated, may have positively affected the residuals plotted and shown through our statistical modelling approach, as observed residuals from the regression model maintained a degree of constant variability.

Moreover, the reduced lack of sample size in relation to the great number of variables we had after one hot encoding our 8 features may have contributed to a more diluted and low-resolution prediction. Thus, the reduced sample size of our dataset used through the statistical modelling may have reduced the predictive potential of our data set, especially given the highly granular nature of one of our features (neighbourhood).

Furthermore, by the nature of our dataset (Airbnb listings in NY), spatial variables could have been used in addition to simply categorical features. This may have provided greater predictive potential, as opposed to simply using granulated categorical features, out of fear of its complexity.

Summary and Conclusions

Project Summary

We began by removing features of our dataset that don’t aid predictive modelling. Then we cleaned it for missing values and certain columns for outliers, carefully considering decisions that would not be detrimental to the predictive quality of our model, in case columns like price with large outliers are necessary for maintaining the accuracy of our model. We then fitted the cleaned data to an OLS regression model and checked for whether that data should drop the neighbourhoods feature or include it based on its r-squared value. We doubted the safety of using a column with hundreds of categorical values, though we ended up keeping it for its higher resultant r-squared value. We then checked for the assumptions of a multiple linear regression model by testing linearity, residual normality, and constant variance. We couldn’t test for independence but we had justification for why our observations were independent. We then performed backward feature selection by using code from the sample template and checked for residuals of the data. We then checked conditions for our reduced linear model and noted the changes.

Summary of Findings

The regression model based on our preprocessed data showed support for the regression assumptions, except for linearity which didn't necessarily have a clear structure, though had some linear patterns. What our findings showed, in regards to hotel prices, is that most hotel prices were below 200, and it was in this region that great normality was observed in residuals. After the 200 dollar mark, it was observed that the histogram plot had skewed more towards the right side, appearing to be more stretched, showing that there was greater variability after the 200 dollar mark.

After reducing the model through the backward selection feature process, we ended up with 41 features for our model. The graphs of the residuals from both the full and reduced model showed no visible variation, showing that despite the p-value being low for such features, they were important for prediction nonetheless. The removed features were removed because it was believed that they did not have enough data backing them up, due to our limited 5000 sample size, considering the nature of the dataset.

Moreover, the R-squared and adjusted R-squared values for the full model were 0.541 and 0.524 respectively whereas for the reduced model, they were 0.529 and 0.525 respectively. The adjusted R-squared value only increased in 0.001 units, meaning that the removal of 145 features did not have much of an influence on the model's predictive power, which may have supported our observation of the dataset not having enough data supporting the multitude of features post-one-hot encoding.

Conclusions

We have found that our New York Airbnb listings dataset includes features that significantly impact prices as well as features that have a minor effect on our response variable. Our objective was to find factors that affected the price of Airbnb listings the greatest and to use them to better predict the prices of listings based on features of our dataset. In addition, our predictive modelling was intended to help predict prices of Airbnb listings for many use cases like to aid banking firms, to help market researchers, and improve price accuracy for consumers and Airbnb hosts alike. Overall, our predictive model was moderately accurate at predicting the prices of Airbnb listings based on the R-Squared Value.

References

References